Dynamic Miss-Counting Algorithms: Finding Implication and Similarity Rules with Con dence Pruning

نویسندگان

  • Shinji Fujiwara
  • Rajeev Motwani
چکیده

Dynamic Miss-Counting (DMC) algorithms are proposed, which nd all implication and similarity rules with conndence pruning but without support pruning. To handle data sets with a large number of columns, we propose dynamic pruning techniques that can be applied during data scanning. DMC counts the numbers of rows in which each pair of columns disagree instead of counting the number of hits. DMC deletes a candidate as soon as the number of misses exceeds the maximum number of misses allowed for that pair. We also propose several optimization techniques that reduce the required memory size signiicantly. We evaluated our algorithms by using 4 data sets, i.e., Web access logs, Web page-link graph, News documents, and a Dictionary. These data sets have between 74,000 and 700,000 items. Experiments show that DMC can nd high-conndence rules for such a large data sets eeciently.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Dynamic Miss-Counting Algorithms: Finding Implication and Similarity Rules with Confidence Pruning

Dynamic Miss-Countingalgorithms are proposed, which find all implication and similarity rules with confidence pruning but without support pruning. To handle data sets with a large number of columns, we propose dynamic pruning techniques that can be applied during data scanning. DMC counts the numbers of rows in which each pair of columns disagree instead of counting the number of hits. DMC dele...

متن کامل

Axiomatization of frequent itemsets

Mining association rules is very popular in the data mining community. Most algorithms designed for finding association rules start with searching for frequent itemsets. Typically, in these algorithms, counting phases and pruning phases are interleaved. In the counting phase, partial information about the frequencies of selected itemsets is gathered. In the pruning phase as much as possible of ...

متن کامل

Con dence Measures for Multimodal Identity

Multimodal fusion for identity veri cation has already shown great improvement compared to unimodal algorithms. In this paper, we propose to integrate con dence measures during the fusion process. We present a comparison of three di erent methods to generate such con dence information from unimodal identity veri cation systems. These methods can be used either to enhance the performance of a mu...

متن کامل

On Tighter Inequalities for Efficient Similarity Search in Metric Spaces

Similarity search consists of the efficient retrieval of relevant information satisfying user formulated query conditions from a database with prebuilt indexing structures. Since the evaluation of the distance functions between queries and indexed objects is often computationally expensive, there have been many attempts to build indexing structures that use as few distance computations as possi...

متن کامل

A Fast Algorithm for Discovering Optimal String Patterns in Large Text Databases

We consider a data mining problem in a large collection of unstructured texts based on association rules over subwords of texts. A two-words association pattern is an expression such as (TATA, 30, AGGAGGT)) C that expresses a rule that if a text contains a subword TATA followed by another subword AGGAGGT with distance no more than 30 letters then a property C will hold with high probability. Th...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999